Qwen2-Audio-7B-DPO-CodeSwitch
A LoRA adapter for Qwen/Qwen2-Audio-7B-Instruct fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.
Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark |
Baseline |
This Model |
Improvement |
| SEAME |
0.5677 |
0.5301 |
-6.6% |
| EMILIA |
0.4470 |
0.4208 |
-5.9% |
| CS-Dialogue |
0.3891 |
0.3140 |
-19.3% |
Benchmark Descriptions
- SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
- EMILIA: Synthetic code-switching evaluation set (1,000 samples)
- CS-Dialogue: Code-switching dialogue evaluation set (359 samples)
Examples
Below are examples showing improvements from baseline to DPO-trained model:
Example 1: Code-Switching Preserved (Lifestyle)
|
Transcription |
| Ground Truth |
我们 都 应该 pursue a healthy lifestyle |
| Baseline |
我们都应该追求健康的生活方式 (fully translated to Chinese) |
| This Model |
我们都应该 pursue a healthy lifestyle |
| MER |
1.00 → 0.00 |
Example 2: Mixed Language Preserved (Christmas)
|
Transcription |
| Ground Truth |
every christmas 我 就 应该 是 没有 人 跟我 庆祝 了 [啦] |
| Baseline |
every christmas i would - should be no one to tell me (Chinese translated to English) |
| This Model |
every christmas 我就应该是没有人跟我庆祝了啦 |
| MER |
0.88 → 0.00 |
Example 3: Technical Terms Preserved
|
Transcription |
| Ground Truth |
(呃) 每个 lecture different lecturer 那个 notes 不 不怎么 好的 [啦] |
| Baseline |
呃那个老师不同风格的老师 (lost technical terms) |
| This Model |
呃 每个 lecture different lecturer 那个 notes 不不怎么好的啦 |
| MER |
0.75 → 0.00 |
Example 4: Complex Code-Switching Preserved
|
Transcription |
| Ground Truth |
[哦] 还有 什么 好吃 的 吗 还是 你 只是 去 那些 very expensive places like dempsey to eat |
| Baseline |
Oh, what else? Oh, yeah, there's always that expensive place like... to eat (lost Chinese content) |
| This Model |
哦 还有什么好吃的吗 还是你只是去那些 very expensive places like dancy to eat |
| MER |
0.83 → 0.04 |
Example 5: Professional Terms Preserved
|
Transcription |
| Ground Truth |
[哦] 因为 是个 professional degree |
| Baseline |
哦因为他有个专业的学位 (translated to Chinese) |
| This Model |
哦 因为 是个 professional degree |
| MER |
1.00 → 0.00 |
Training Configuration
Model Architecture
| Parameter |
Value |
| Base Model |
Qwen/Qwen2-Audio-7B-Instruct |
| Adapter Type |
LoRA (Low-Rank Adaptation) |
| LoRA Rank (r) |
256 |
| LoRA Alpha |
128 |
| LoRA Dropout |
0.05 |
| LoRA Target Modules |
All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj) |
| Trainable Parameters |
~1.28B (adapter only) |
Training Hyperparameters
| Parameter |
Value |
| Training Method |
DPO (Direct Preference Optimization) |
| DPO Beta (β) |
0.3 |
| DPO Loss |
Sigmoid |
| Learning Rate |
3e-5 |
| LR Scheduler |
Cosine |
| Warmup Ratio |
0.1 |
| Batch Size (per device) |
1 |
| Gradient Accumulation Steps |
4 |
| Global Batch Size |
32 (8 GPUs × 1 × 4) |
| Precision |
BF16 |
| Max Sequence Length |
8192 |
| Weight Decay |
0.01 |
| Max Gradient Norm |
1.0 |
Usage
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
import librosa
base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-Audio-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-Audio-7B-Instruct",
trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
model.eval()
conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "path/to/audio.wav"},
{"type": "text", "text": "Please transcribe this speech."}
]}
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=256)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Files
├── README.md # This file
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # LoRA adapter weights (~1.28 GB)
├── tokenizer files # Tokenizer assets
└── eval_results/
├── baseline_seame.json # Baseline model results on SEAME
├── baseline_emilia.json # Baseline model results on EMILIA
├── baseline_cs_dialogue.json # Baseline model results on CS-Dialogue
├── trained_seame.json # This model's results on SEAME
├── trained_emilia.json # This model's results on EMILIA
└── trained_cs_dialogue.json # This model's results on CS-Dialogue
License
This adapter inherits the license of the base Qwen2-Audio-7B-Instruct model.